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CUSTOMER PROFILING APPARATUS FOR CONDUCTING CUSTOMER 
BEHAVIOR PATTERN ANALYSIS, AND METHOD FOR COMPARING 
CUSTOMER BEHAVIOR PATTERNS 



FIELD OF THE INVENTION 
The present invention pertains to the field of data analysis. More 
particularly, this invention relates to methods and apparatus for profiling 
customer behavior. 

BACKGROUND OF THE INVENTION 
In recent years, the ability to profile customer behavior and to 
conduct associated pattern analysis has become desirable. For example, the 
need for improved techniques in these areas has recently been identified in the 
fast-growing fields of telecommunications and electronic commerce, or 
e-commerce. However, existing tools and techniques do not deliver sufficient 
capabilities. 

For example, one technique involves creating customer behavior 
profiles from very large collections of transaction data. More particularly, a 
database is formed with transaction data, the database is batch updated, and 
the database is structured for fast online queries and summaries by department 
managers. For example, in telecommunications applications, hundreds of 
millions of call records may have to be processed daily in order to create and 
incrementally update customer calling behavior profiles. These customer calling 
behavior profiles are contained within data warehouse systems. Hence, such 
data warehouses can contain enormous amounts of data which can make 
searching for one customer profile unwieldy and time consuming. 

Therefore, there exists a present need for improved techniques for 
profiling customer behavior. 
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SUMMARY OF THE INVENTION 
A profiling apparatus includes a data-warehouse and an OLAP based, 
scalable and flexible profiling engine. The apparatus is configured to define 
profiles using probability distributions, and compute the profiles using OLAP 
5 operations on multidimensional and multilevel data cubes, A formalism is used 
for defining and computing various similarity measures on patterns derived from 
behavior profiles. One profiling engine has been implemented using a 
commercial data warehouse server and a multidimensional OLAP server. 
Experience has revealed the simplicity and powerfulness of OLAP-based 

10 solutions to probability-oriented, scalable profiling and multilevel, 
multidimensional pattern analysis and comparison. 

According to one aspect, a customer profiling apparatus is provided 
for conducting customer behavior pattern analysis. The apparatus includes 
processing circuitry, a data warehouse, a profiling engine, and at least one 

15 computer program. The processing circuitry is operative to process customer 
records. The data warehouse is coupled with the processing circuitry and is 
configured to store the processed customer records. The profiling engine 
communicates with the data warehouse and is operative to build and update 
customer behavior profiles by mining the customer records that flow into the 

20 data warehouse. The at least one computer program is performed by the 

profiling engine, and is operative to define behavior profiles as data cubes and 
derive similarity measures on patterns extracted from the behavior profiles. 

According to another aspect, a profiling apparatus is provided. The 
profiling apparatus includes a data warehouse, a profiling engine, and a 

25 computer application program. The data warehouse is configured to store 

customer records. The profiling engine communicates with the data warehouse 
and is operative to generate customer behavior profiles from the customer 
records within the data warehouse. The computer application program is 
implemented on the profiling engine and is operative to represent behavior 

30 profiles as patterns and derive similarity measures of the patterns usable to 
profile customer behavior. 
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According to yet another aspect, a method is provided for comparing 
customer behavior patterns. The method includes: providing call data in the 
form of call data records to a data warehouse; loading the call data records into 
an OLAP server; generating a profile-snapshot cube accommodating multiple 
customers; in combination with generating the profile-snapshot cube, 
generating a profile cube for the same set of customers from the data 
warehouse; updating the profile cube by merging the profile cube with the 
profile-snapshot cube; and storing the updated profile cube in the data 
warehouse. 

DESCRIPTION OF THE DRAWINGS 

Preferred embodiments of the invention are described below with 
reference to the following accompanying drawings depicting examples 
embodying the best mode for practicing the invention. 

Fig. 1 is a schematic block diagram illustrating an exemplary hardware 
environment used to implement one aspect of the present invention. 

Fig. 2 is a conceptual diagram illustrating the derivation of probability 
cubes from volume cubes in the form of persistent data warehouse objects. 

Fig. 3 is a schematic block diagram illustrating a profiling engine 
architecture according to the present invention including a data warehouse and 
an OLAP server. 

Fig. 4 is a conceptual diagram illustrating the manner in which a 
profile-snapshot cube (PCS) is populated using binning techniques. 

Fig. 5 is a logic flow diagram illustrating the steps performed when 
profiling customer behavior according to the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 
This disclosure of the invention is submitted in furtherance of the 
constitutional purposes of the U.S. Patent Laws "to promote the progress of 
science and useful arts". U.S. Constitution, Article 1, Section 8. 
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Profiling customer behavior has become increasingly important for 
many applications such as fraud detection, targeted marketing and promotion. 
Customer behavior profiles are created from very large collections of transaction 
data. For example, in telecommunication applications, hundreds of millions of 
call records may have to be processed daily in order to create and incrementally 
update customer calling behavior profiles. 

As a result, an apparatus for profiling customer behavior has been 
developed and is identified in Figured 1 by reference numeral 10. Profiling 
apparatus 10 includes a data-warehouse server 12 and an OLAP server 14. 
Such OLAP profiling engine 16 comprises an OLAP-based scalable and flexible 
profiling engine 16 that can be used in many applications. 

As used herein, profiling engine refers to a specialized processor used 
to implement customer behavior analysis. According to one implementation, 
profiling engine 16 is an OLAP-based specialized processor for performing 
customer behavior analysis. The profiling of customer behavior aims at 
extracting patterns of customer activities from transactional data, and using 
such patterns to provide guidelines for service provisioning, trend analysis, 
abnormal behavior detection, and other related customer patterns of activity. 
Profiling engine 16 is capable of building and updating customer calling behavior 
profiles incrementally by mining call records that flow into a data-warehouse on 
a daily basis, deriving calling patterns from profiles, and analyzing and 
comparing the similarity of calling patterns. According to one implementation, 
an OLAP server is used as a scalable computation engine in order to support 
profile computation, maintenance and utilization. According to such one 
implementation, profiling engine 1 6 builds and incrementally updates customer 
calling behavior profiles by mining call records flowing into a data-warehouse on 
a daily basis, using an OLAP server. Secondly, profiling engine 16 maintains 
profiles by staging data between a data-warehouse and an OLAP 
multidimensional database. Thirdly, profiling engine 16 derives multilevel and 
multidimensional customer calling patterns from profiles for analysis purposes. 
Finally, profiling engine 16 compares the similarity of customer calling patterns 
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from volume and probability distribution points of view, and generates multilevel 
and multidimensional similarity measures, with such measures being used in 
various applications; for example, for use in fraud detection. Further according 
to the one implementation, profiling engine 16 is built on top of an Oracle-8 
based data-warehouse and an Oracle Express, and an OLAP server. Call data 
records, customer behavior profiles, and other reference data are stored within 
the warehouse. The call data records are fed in dally and are dumped to archive 
after use. In this manner, the OLAP server is used as a computation engine for 
creating and updating profiles, deriving calling patterns from profiles, as well as 
analyzing and comparing calling patterns. 

According to one implementation, profiling apparatus 10 is provided 
within a data mining system 18 that is implemented over a communication 
platform 20 such as a network 22. According to one construction, 
communication platform 20 comprises a client/server network 22 configured to 
process applications designed for client/server architectures. Specific 
implementations for network 22 include the Internet, or the World Wide Web 
(WWW). Accordingly, client/server network 22 includes a client 24 as well as 
servers 12 and 14. Client 24 includes a central processing unit (CPU) 26, 
memory 28, and a graphical user interface (GUI) 30. Similarly, server 12 
includes central processing unit (CPU), or processing circuitry, 32 and memory 
34, while server 14 includes central processing unit (CPU) 36 and memory 38. 
OLAP server 14 also includes probability distributions 40 and data cubes 42. 
Behavior profiles 44 are provided within data cubes 42. 

As shown in Figure 1 , behavior profiles 44 are first defined by 
probability distributions, then are computed using OLAP operations on 
multidimensional and multilevel data cubes 42. A formalism is provided for 
defining and computing various similarity measures on patterns derived from 
these behavior profiles. 

According to one implementation, profiling engine 16 is realized using 
commercial data warehouse server 12 and multidimensional OLAP server 14. 
Experience has revealed the simplicity and powerfulness of OLAP-based 
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solutions to probability-oriented, scalable profiling and multilevel, 
multidimensional pattern analysis and comparison. 

I. Overview 

5 The profiling of customer behavior is carried out with the aim of 

extracting patterns of customer activities from transactional data. Extracted 
patterns are then used to provide guidelines when making business decisions 
such as when determining service provisioning, performing trend analysis, and 
detecting abnormal behavior. Such profiling and analysis has become 
_ 10 increasingly important in a variety of application domains, such as fraud 
P detection, personalized marketing and commercial promotion. Such profiling and 

T analysis has also given rise to the need for a scalable infrastructure in order to 

7; support filtering, data mining and continuous analysis of massive amounts of 

transaction data. The provision of such an infrastructure has been realized with 
15 implementation of the data-warehousing and OLAP technology identified by data 
=p: mining system 18 of Figure 1. 

In order to realize profiling apparatus 10 of data mining system 18, 
P the construction and application of customer behavior profiles is obtained from 

telephone call data for the purpose of conducting fraud detection. Call fraud is a 
20 serious problem that generates a significant amount of revenue loss each year 
for telecommunication service providers. One way of carrying out fraud 
detection is to discover abnormal calling behavior. Typically, a customer's calling 
behavior is represented by the composition and periodic appearance of the 
following features. One feature is to identify the callees, or the numbers that 
25 have been called. Another feature is to identify when calls are made, or the 
time windows in which the calls have been made. Yet another feature is to 
identify how long specific calls last, or the duration of specific calls. 

Behavior-based fraud detection may be further classified into the 
following two categories. The first is "threshold based fraud detection", such as 
30 a call determined to be suspicious if its duration is greater than 24 hours. A 
second is where a call is determined to be suspicious if its duration is greater 
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than 4 hours, and the call is made in the evening. The second is "pattern based 
fraud detection", such as where a specific caller, as identified by phone number, 
is identified as being suspicious if the caller's calling pattern is similar to a 
previously known fraudulent pattern, based upon comparison with the 
5 previously known fraudulent pattern. 

The profiling of callers' behavior is significant for both of the above- 
described kinds of fraud detection. In the case of the first category of threshold- 
based fraud detection, without having information about personalized calling 
behavior, only generalized thresholds may be set, such as considering generally 

10 that a call is determined to be suspicious if it lasts over 24 hours. With the 
availability of a customer's calling behavior profile, personalized, rather than 
generalized thresholds can be set, so that, for example, calls by a specific 
individual identified as "John" which last for 4 hours are considered usual, but 
calls by a specific individual identified as "Jane" which last for 2 hours are 

15 considered unusual. Thus personalized or group-based thresholds can be used 
in order to provide more precise fraud detection than is available using more 
generalized thresholds. Similarly, pattern based fraud detection is based on 
profiles that carry out pattern matching. Each new customer's calling behavior 
is profiled and compared against known fraudulent profiles. Customer profiles 

20 are also useful for other summary information oriented applications. 

In order to create and update effective customer behavior profiles, 
hundreds of millions of call records are required to be processed nearly every 
day. In order to carry this out more efficiently, a scalable and maintainable 
framework has been developed to support such profiling. More particularly, 

25 profiling engine 16, according to one construction, is built on top of an Oracle-8 
based telecommunication data-warehouse 46 and includes Oracle Express. 
Oracle Express is a multidimensional OLAP server 14 that implements OnLine 
Analytical Processing (OLAP). OLAP server 14 implements decision support 
software 48 that allows a user to quickly analyze information that is summarized 

30 into multidimensional views and hierarchies. 
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According to OLAP, profiles and calling patterns are represented as 
multidimensional cubes 42. Profiling engine 16 is capable of building and 
updating customer calling behavior profiles incrementally by mining call records 
that flow into the data-warehouse daily, deriving "calling patterns" from profiles, 
and analyzing and comparing the similarity of ''calling patterns". There exists 
practical value in using an OLAP server 14 as a scalable computation engine, or 
profiling engine, 1 6 in order to support profile computation, maintenance and 
utilization. Profiles and "calling patterns" are based on the probability 
distribution of call volumes. OLAP can be used in order to compute these 
probability distributions as described below in further detail, 

OLAP technology within OLAP server 14 is used to analyze data 
maintained in data-warehouses such as data-warehouse 46, More particularly, 
OLAP tools 52 are used to support large-scale data mining. One significant 
feature provided by Applicant's invention is the ability to use OLAP based 
customer behavior profiling and pattern analysis. 

As will be described below in greater detail, behavior profiling is 
carried out using probability distributions. Also described below in greater detail 
is one suitable profiling engine architecture 54 for profiling engine 16. 
Furthermore, exemplary algorithms are described below for computing profile 
and calling pattern cubes. 

II. Probabilitv Distribution Based Profiling Using OLAP 
A. Behavior Profiling with Probability Distribution 

In order to implement customer behavior profiling, a user implementing 
the profiling need to decide which features, or dimensions, are particularly 
relevant to the profiling being considered. For example, when implementing a 
customer telephone calling behavior profiling application, exemplary features of 
Interest could be specific telephone-numbers, call volume {/,e.,the number of 
calls made), call duration, time of day, and day of week for a customer's 
outgoing and incoming calls. 
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Secondly, the granularity of each feature needs to be selected. Thus, 
the time of day feature may be represented by the following time-bins: 
'morning'; 'afternoon'; 'evening'; and 'night'. The duration feature may be 
represented by: 'short' (shorter than 20 minutes); 'medium' (20 to 60 minutes); 
5 or 'long' (longer than 60 minutes). Finally, a profiling interval {e.g., 3 months) is 
chosen over which the customer profiles will be constructed, and a periodicity 
of the profiles [e.g., weekly) is also chosen. 

According to one exemplary application, a customer's profile is a 
weekly summarization of the customer's calling behavior during the profiling 
10 interval. Based on the profiled information, calling patterns of individual 

customers may be derived. As one example, the following three kinds of calling 
patterns can be considered: 

1 ) "Fixed-value based calling pattern" represents a customer's 
calling behavior with fixed values showing the customer's "average" behavior. 
15 For example, the following calling pattern from number A to number B says that 
on the average calls are short. 



morning 


afternoon 


evening 


medium 


short 


long 



20 

Table 1 - A fixed value based calling pattern 
2) "Volume based calling pattern" summarizes a customer's calling 
behavior by counting the number of calls of different duration in different time- 
25 bins. For example, the following calling pattern from number A to number B says 
that there were 350 short calls carried out in the mornings of the profiling 
period. 





morning 


afternoon 


evening 


short 


350 


30 


80 


medium 


100 


40 


200 


long 


50 


30 


120 



Table 2 - A volume based calling pattern 
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3) "Probability distribution based calling pattern" represents a 
customer's calling behavior with probability distributions. For example, the 
following calling pattern from number A to number B says that 10% of the calls 
in the morning were long, 20% were medium, and 70% were short. 





morning 


afternoon 


evening 


short 


0.7 


0.3 


0.2 


medium 


0.2 


0.4 


0.5 


long 


0.1 


0.3 


0.3 



Table 3 - A probability distribution based calling pattern 

According to Applicant's invention, volume-based calling patterns are 
used as an intermediate step when deriving probability-based calling patterns. 

Probability distribution based calling patterns provide more fine-grained 
representation of dynamic behavior than fixed value based ones. Probability 
distribution based calling patterns also allow calling patterns corresponding to 
different lengths of profiling interval to be compared. For example, for the case 
of a threshold-based fraud detection application, a long duration call may be 
noticed before reaching an absolute threshold, and the call can be monitored 
with increasing levels of scrutiny as the probability of fraud becomes higher and 
higher. Probability distribution based calling patterns also provide more details 
of individual behavior, which has not been available to the same degree using 
fixed-value based calling patterns. 

B. Computing Probability Distribution Based Calling Patterns Using OLAP 

When using OLAP, profiles and calling patterns are represented as 
cubes 42 (see Fig. 1). A cube has a set of underlying dimensions, and each cell 
of the cube is identified by one value from each of these dimensions. The set of 
values of a dimension D, called the domain of D, may be limited (by the OLAP 
limit operation) to a subset. A sub-cube (slice or dice) can be derived from a 
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cube C by dimensioning C by a subset of its dimensions, and/or by limiting the 
value sets of these dimensions. 

As mentioned above, the profile of a customer is a weel<ly 
summarization of the customer's activities in the profiling period. In order to be 
5 more efficient, according to one prototype profiling system a group is made of 
the information for profiling multiple customers' calling behavior into a single 
profile cube with dimensions <duration, time, dow, callee, caller>, where dow 
stands for day_of_week (e„g., Monday,..., Sunday), callee and caller are calling 
and called phone numbers. The value of a cell in a profiling cube measures the 
^ 10 volume; Ae., number of calls, made in the corresponding duration-bin, time-bin in 
y a day, and day of week, during the profiling period. In this way, a profile cube 

CP records multiple customers' outgoing and incoming calls week by week. From 

jj such a multi-customer profile cube, calling pattern cubes of individual customers 

'7 may be derived. They have similar dimensions as the profile cubes except that a 

15 calling pattern cube for outgoing calls is not dimensioned by caller, and a calling 
1| pattern cube for incoming calls is not dimensioned by callee, because such 

cubes each pertain to a single customer. 
0 In order to represent such calling behavior at multiple levels, 

dimensions dow, time and duration are defined as hierarchical dimensions, along 
20 which the calling pattern cubes can be rolled up. A hierarchical dimension D 
contains values at different levels of abstraction. Associated with D there are: 
a) a level-dimension describing the levels of £); b) a relation mapping each value 
of D to the appropriate level; and c) a relation mapping each value of D to its 
parent value (the value at the Immediate upper level). The above relationships 
25 provide sufficient information to roll up cube C along dimension D) that is, to 
calculate the total of cube data at the upper levels using the corresponding 
lower-level data. A cube may be rolled up along multiple underlying dimensions. 
For example, dimensions duration, time and dow are defined in the following 
way: 

30 1. Dimension dow is associated with dimension dowLevel 

with values 'dd', 'ww', 'week', dow has values 'MON',..., 'SUN' at 
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the lowest level (dd level), 'wkday', 'wkend' at a higher level (ww 
level), and 'week' at the top level {'week' level). 

2. Dimension time is associated with dimension timeLevel 
with values 'time_.bin' and 'time_air. time has values 'night', 

5 'morning', 'afternoon' and 'evening' at 'time_bin' level (bottom-level), 

and 'allday' at the 'time_all' level (top-level). 

3. Dimension duration is associated with dimension 
durLevel with values 'dur_bin' and 'dur^all'. duration has values 
'short', 'medium', 'long' at 'dur_bin' level (bottom-level), and 'all' at 

^ 10 'dur_air level (top-level). 

yj For example, volume based calling pattern cube "CB.o" represents the 

d1 outgoing calling behavior of a customer. In Oracle Express, that is defined in 

i;j the following way: 

Define CB.o variable int < sparse < duration time dow cailee> > inplace 
15 Similarly, volume based cube "CB.d" represents incoming calling behavior. In 
:rj Oracle Express, that is defined in the following way: 

Define CB.d variable int < sparse < duration time dow caller > > inplace 
The cell values of the above-described cubes are the number of calls 
falling into the given 'slot' of time, duration, day of week, etc. When generated, 
20 cubes "CB.o" and "CB.d" are rolled up along dimensions duration, time and 
dow. Therefore, 

CB.o (duration 'short', time 'morning', dow 'MON') 
measures the number of short-duration calls this specific customer made to each 
callee (dimensioned by callee) on Monday mornings during the profiling interval. 
25 Similarly, CB.o (duration 'all', time 'aliday', dow 'week') measures the number of calls 

this customer made to each callee (total calls dimensioned by callee) during the 
profiling interval. 

Cubes representing probability distribution based calling patterns are 
derived from volume based pattern cubes. Depending on application 
30 requirements, various cubes may be derived. For example, cube "P_CB.o" for a 
customer represents the dimensioned probability distribution of outgoing calls 
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over all the outgoing calls made by this customer, and is derived from cube 

"CB,o" in the following way. 

Define P_CB.o formula decimal < duration time dow callee> 
EQ (CB.o/totaKCB.o (duration 'all', 'allday', dow 'week'))) 
5 where total (CB.o (duration 'all', 'allday', dow 'week')) is the total calls this customer 

made to ail callees. Remember that cube "CB.o" has already been rolled up; 

hence, we can use its top-level value). The value of a cell is the above 

probability corresponding to the underlying dimension values. 

As another example, cube 'T1_CB.o" is dimensioned by duration and 
10 callee, and represents the probability distribution of a customer's outgoing calls 

over his total calls to the corresponding callee, and is also derived from cube 

"CB.o" as specified in the following manner. 

Define P1_CB.o formula decimal < duration time dow callee > 
EQ (CB.o/total (CB.o (duration 'all', 'allday', dow 'week'), callee)) 
15 where total (CB.o (duration 'all', 'allday', dow 'week') is the total calls this customer 

made to each callee, dimensioned by callee. The value of a cell is the above 

probability corresponding to the underlying dimension values. Calling pattern 

cubes for incoming calls can be defined similarly. 

20 C. Calling Pattern Similarity Comparison 

Calling pattern comparison is important for such applications as fraud 
detection. Since the similarity of customer behavior can be represented from 
different angles, we compare calling patterns that have been derived from 
customer calling behavior profiles, rather than by comparing profiles directly. 

25 For example, some calling patterns might be similar in the volume of calls to the 
same set of callees. Others might be similar in the time of these calls, such as 
those that occur late at night. Our objective, therefore, is to enable the 
comparison of calling patterns along multiple dimensions and at multiple levels 
of the dimension hierarchies. 

30 Given two input calling pattern cubes, say C; and the output of 

the comparison is a similarity cube, say C^, rather than a single value. The 
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similarity cube C; can be dimensioned differently from cubes C; and C2 which 
are being compared. Each cell of represents the similarity of a pair of 
corresponding sub-cubes, or slices or dices, of C; and C2. 

In order to support such cube similarity comparison, the following 
should be provided: 

1 . The mapping from a cell of to a pair of corresponding 
sub-cubes of C; and C2. 

2. The algebraic structure for summarizing cell-wise 
comparison results of a pair of sub-cubes to a single similarity 
measure to be stored in the corresponding cell of C^. We have 
introduced the following two approaches. One treats a sub-cube as a 
bag, and summarizes cell-wise comparison results based on bag 
overlap. The other treats a sub-cube as a vector, and summarizes 
cell-wise comparison results based on vector distance. 

In order to compare customers' behavior from multiple aspects (e.g., 
from volume or probability distribution perspective), different calling patterns 
may need to be compared. Hence, multiple similarity cubes may need to be 
generated. The choice of which calling pattern cubes to use and which 
similarity cubes to generate will be application specific. 

D. Bag Overlap based Cube Similarity 

Bag overlap based similarity comparison is applicable to a special kind 
of cubes called "count-cubes". The cell-values of a count-cube are non-negative 
integers for measuring counts. A volume based calling pattern cube, as defined 
in this application, is a count-cube. 

The similarity of two bags may be defined on the basis of their 
overlap. Let A and B be bags, (Ja{x) be the count of element x in bag A) that 

is, the number of copies of x in A, and e,n and O be bag containment, 
intersection and union. Briefly speaking, they are defined as illustrated below. 
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A±B<:>i\fx e A)aA(x) < oi(x) 

where gib e stands for the "greatest lower bound" of A and B under the bag- 
containment relationship, that is the largest bag C such thatCc A and 
CcB. 

AOB<:>luha(A,B) 
where lub e stands for the "least upper bound" of A and B under the bag- 
containment relationship, that is the smallest bag C such that^ c C and 
Be: C . With these notions, we define the similarity of bag A and B, ij.(A,B). as 



M(A,B)= ' '"^ Where fJ.,,--^ 

Following is a step-by-step explanation of how to apply these formalisms to 
compare the similarity of count-cubes. 

First we view a count-cube as a bag in the following way. An 
element of the bag is identified by the list of dimension-values underlying a cell 
of the cube, and the count of that element {/.e., the copies of that element in 
the bag) is represented by the cell value. A cell value 0 means the non- 
existence of that element in the bag. Thus, for example, 

CB.o{duration 'short', time 'morning', dow 'MON', callee '4089194444') = 5 
stands for a bag element with count 5. 

Next, since two count-cubes, say C; and C^, can be viewed as bags, 
they can be compared using the above bag-overlap based similarity measure, 
denoted by Ai:<C;, ^ [0,1]. 

Further, any sub-cube, or slice or dice, of a count-cube is also a 
count-cube, and thus can be mapped to a bag. This allows the corresponding 
sub-cubes of two count-cubes to be compared based on bag-overlap. 
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Given two count-cubes C^[D] and C2/D7 with the same underlying 
dimensions D={D^^ DJ, comparing their similarity and putting the 
comparison results into a similarity cube CJDJ, are handled as follows: 

1 . Define the similarity cube CJDJ, with each cell of Q 

corresponding to a sub-cube of C; {and C2). This requires us to 

provide a mapping from a cube-cell e dimensioned by to a sub-cube 

C^dimensioned by ^ D , expressed as 

The following two general types of cell-to-subcube mappings have 
been introduced: 

(a) Type A ("projection") cell-to-subcube mapping: making 
Ds^D . As a result, each cell of C5 is mapped to a 

sub -cube of C; (or C2). As an example, assume that C; 
and C2 are dimensioned by <duration, timer dow, 
callee> , but by <duration, time dow> . Then each 
cell of Cs corresponds to a sub-cube of C; (and C2) based 
on the same values as e on dimension duration^ time and 
dow but all values on dimension cafiee. 

(b) Type B (''change level") cell-to-subcube mapping: having 
dimension d such that a value of d^ corresponds to 
multiple values of d. In other words, there exist a one- 
to-many relationship between d^ and d. For example, if 
C; and C2 are dimensioned by <duration, time, dow, 
callee> but by KdurLevel, timeLevel, dowLevel> , 
and there exist many-to-one mappings from duration, 
time and dow to durLevel^ timeLevel and dowLevel, 
respectively, then each cell of corresponds to a sub- 
cube of C; (and C2). These sub-cubes are based on 
multiple values on dimension duration, time and dow. 
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2. For each cell e of C^, identify the pair of corresponding 
sub-cubes of C; and C^, say C; and C2, then the value of e is the 
similarity of and C2 compared based on bag-overlap, i.e., 

value(e)- //(ci^Cs) 



E, Vector Distance based Cube Similarity 

Conceptually, cubes may also be viewed as vectors, and the cubes 
may be compared for similarity based on vector distance. 

Given vectors <f;,.», t^> and ^n> ' their distance df, may be 

measured by the following formula: 



and their similarity can be easily derived from their distance. The cosine based 
vector similarity measure shown below has also been tried. 



However, experience has revealed that it is less sensitive for comparing vectors 
representing call volumes. 

Following is an explanation of how to view cubes as vectors and 
compare their similarity based on vector distance. 

A cube is viewed as a vector of field/value pairs in the following way. 
A "field" of the vector is identified by the dimension-values underlying a cube 
ceil, and the "value" of that "field" is the cell value. For example: 

CB.o(duration 'short', time 'morning', dow 'MON', callee '4089194444') = 5 
stands for one vector element with field identified by < duration 'short', time 
'morning', dow 'MON', callee '40891 94444' > and value equals to 5. 
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Viewing cubes, say C; and C2, as vectors, allows us to compare their 
similarity V(C^, C2) based on vector distance, as 
V:<Ci,Ci >-> [0,1] 

Further, any sub-cube, or slice or dice, of a cube can also be mapped to a 
5 vector, and that allows each pair of corresponding sub-cubes of two cubes to be 
compared based on vector-distance. 

Therefore, given two cubes C^[D] and CzlD] with the same underlying 
dimensions D = /£>,,..., DJ, comparing their similarity and putting the 
comparison results into a similarity cube CJDJ, are handled as follows: 
10 1 . Define the similarity cube CJDJ, with each cell of 

corresponding to a sub-cube of C; (and C2), using Type A or Type B 
cell-to-subcube mapping described above. 

2. For each cell e of C^, select the pair of corresponding 
sub-cubes of C; and C2, say C; and C2, then the value of e is the 
15 similarity of C; and C2 compared based on vector-distance, i.e., 

value{e) = V(c^, C2) 
In the following sections several similarity cubes generated by 
comparing calling pattern cubes are shown. 

20 F. Volume based itemized Similarity Cubes 

Decimal cubes "ISIM.I" and ^'ISIM.d" measure calling pattern similarity 
on outgoing and incoming call volumes, respectively. Using the Oracle Express 
language, they are defined as: 

define ISIM.o variable decimal < sparse < duration time dow> > inplace 
25 define ISIM.d variable decimal < sparse < duration time dGw> > inplace 

Cube ''ISIM.o'' will be observed for more details. Cube "ISIM.o" is the 
similarity cube generated from comparing two volume based calling pattern 
cubes "CB.o" and "CB2.o" (which is defined in the same way as "CB.o"). Cube 
"ISIM.o" is dimensioned by Kduration, time, dow> , but cubes "CB.o" and 
30 "CB2.0" are dimensioned by <duration, time, dow callee> , As a result, each 
cell e of cube "ISIM.o" represents the similarity of a pair of corresponding sub- 
Case 10991149-1 
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cubes of cubes "CB.o" and "CB2.o". These sub-cubes are based on the same 
dimension values as e on duration, time and dow, but on all the values of the 
callee dimension. The sub-cube comparison is based on "vector-distance" and 
the "type A cell-to-subcube mapping", as shown in the following algorithm. 
5 Sub-Cube Comparison Algorithm 

algorithm genlSIM 

{ 

- declare int variables a, b, d 

10 - limit duration, time, dow to all their values 

- reduce the size of dimension callee by limiting its values to those 
that make either CB.o > O or CB2.o > O 

15 - compare {In CB.o and CB2.o) the calling volumes for each callee and 

generate a vector-based similarity measure corresponding to each 
combination of duration, time and dow. 

for each dow 
20 { 

for each time 

{ 

for each duration 
{ 

25 d = sqrt (total {(CB.o - CB2.o) * (CB.o - CB2,o))) 

a = sqrt (total (CB.o * CB.o)) 
b = sqrt (total {CB2.0 * CB2.o)) 
conver vector distance to similarity and 
takes some kind of average measure, e,g., 

30 ISlM.o = 1 - d/ (a + b) 

} 

} 

} 

35 Note that the cubes being compared are already rolled up along 

hierarchical dimension duration, time and dow. This allows cube "ISlM.o" to 
present dimensioned similarity measures at multiple levels. For example, in the 
following slice of cube "ISlM.o", the following cell is shown: 
ISIM.o (duration 'ail', time 'morning', dow 'week') 

40 The above cell represents the volume-based calling pattern similarity for 

mornings, based on weekly summarization, across all days and all durations. 
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ISIM.o 

DOW: week 

ISIM.O— 

DURATION 



TIME 


all 


Short 


Medium 


Long 


allday 


0.90 


0.87 


0.64 


0.89 


Night 


0.78 


0.78 


1.00 


1.00 


Morning 


0.93 


0.93 


0.81 


0.88 


Afternoon 


0.87 


0.83 


0.75 


1.00 


Evening 


0.91 


0.88 


0.59 


1.00 



In the following slice of cube "ISIM.o", the following cell is shown: 
15 ISIM.o (duration 'short', time 'morning', dow 'FRi') 

The above cell represents volunne-based calling pattern sinnilarity on short calls 
on Friday nnornings. 

ISIM.o 

DOW: FRI 

20 ISIM.O 

DURATION 



TIME 


all 


Short 


Medium 


Long 


allday 


0.95 


0.92 


0.55 


0.87 


Night 


0.82 


0.82 


1.00 


1.00 


Morning 


0.95 


0.95 


0.87 


0.87 


Afternoon 


0.95 


0.91 


0.66 


1.00 


Evening 


0.95 


0.91 


0.49 


1.00 



30 

G. Volume based Similarity Cubes 

Similarity cubes "VSIM.o" (for outgoing calls) and "VSIM.d" (or 
incoming calls) also measure volume-based calling pattern similarity but are 
dimensioned by levels rather than values of duration, time and dow. They are 
35 defined in Oracle Express as, 

define VSIM.o variable decimal <durLeveI, timeLevel, dowLevel> inplace 
define VSIM.d variable decimal <durLevel, timeLevel, dowLevel> inplace 
Let us consider cube "VSIM.o" for more details. It is the similarity 
cube generated by comparing two volume-based pattern cubes "CB.o" and 
40 "CB2.0". Dimensioned by <durLevel, timeLevel, dowLevel> , the cell of cube 
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"VSIM.o" corresponding to dimensions values durLevel = timeLevel = 
and dowLeveJ = L^, represents the similarity of a pair of corresponding sub- 
cubes of cubes "CB.o" and "CB2.o". These sub-cubes are based on the 
dimension values of duration^, time and dow at level L^, and L^, respectively, 
5 and on all the values of callee dimension. In the following algorithm, "type B 

cell-to-subcube mapping" is used, and the sub-cube comparison is based on bag- 
overlap. We can see that more frequently called numbers have higher 
contribution to pattern comparison, but calls to unmatched destinations have no 
contribution to the similarity of calling patterns. 
10 Tvpe B "Cell-to-Subcube" Mapping Algorithm 

algorithm genVSIM .o 

{ 

- declare int variables a, d 

15 - limit dimensions duration, time, dow to all their values 

- reduce the size of dimensionID by limiting its values to those that make 
either CB.o >0 or CB2.0 >0 

20 - compare (in CB.o and CB2.o) the calling volumes for each callee 

then generate bag-intersection based similarity on calling volume at 
each level of dimensions duration, time and dow 

for each dowLevel 
25 { 

limit dow to values at the current dowLevel 
for each timeLevel 

{ 

limit time to values at the current timeLevel 
30 for each durLevel 

{ 

limit time to values at the current timeLevel 
for each durLevel 

{ 

35 limit duration to values at the current durLevel 

a = total (CB.o) 
b = total (CB2.o)_ 
d = total (min(CB.o, CB2.o)) 
VSIM.o = (d/a * d/b)/2 

40 } 

} 

} 
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As examples, two cells in the cube "VSIM.o" instance are shown 



below. 



1 . Ceil VSIM.o {durLevel 'dur_bin', timeLevel 'time^bin', dowLevel 
'dd') represents the similarity of a corresponding pair of volume-based 

5 sub-cubes of "CB.o" and "CB2.o". These sub-cubes are based on 

values 'short', 'medium', 'long' of dimension duration, values 'night', 
'morning', 'afternoon', 'evening' of dimension time, values 'MON' to 
'SUN' of dimension dow, and all values of dimension ca/lee. The 
value of this cell is the bag-overlap based summarization of cell-wise 

10 comparison of the above pair of sub-cubes. 

2. Cell VSIM.o (durLevel 'dur_all', timeLevel 'time.all', dowLevel 
'week') represents the similarity of such a pair of sub-cubes of "CB.o" 
and "CB2.0" that are based on high-level values of dimensions 
duration, time and dow, and all values of dimension callee, 

15 VSIM.o 

DOWLEVEL: week 



20 



VSIM.O 

DURLEVEL 

TIMELEVEL dur_ali dur^bin 



time.all 0.91 0.88 

time bin 0.91 0.87 



DOWLEVEL: WW 

25 VSIM.O 

DURLEVEL 

TIMELEVEL dur_all dur.bin 

time_ail 0.91 0.88 

30 time.bin 0.91 0.87 



DOWLEVEL: dd 

35 VSIM.O- 



DURLEVEL 

TIMELEVEL dur_all dur_bin 



time_all 0.87 0.84 

40 time_bin 0.85 0.82 
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In general, the degree of similarity may be higher at higher levels of 
the above dimensions. This is because at time_bin and dur_bin levels, two calls 
to the same destination at different times of the day and with different duration 
may be considered somewhat different. However, at top-levels of these 
dimensions, such a difference is removed since the top-level value of time 
dimension covers all day, and the top-level value of duration dimension covers 
any length of call. 

H. Probability based Similarity Cubes 

The similarity of volume-based calling patterns is meaningful only 
when they cover the same time span. This limitation can be eliminated in 
measuring the similarity of probability-based calling patterns. This is especially 
useful in comparing a preset calling pattern with an ongoing pattern in real-time. 

The following cubes measure the similarity of probability-based calling 

patterns: 

define PSIM.o variable decimal <durLevel, timeLevel, dowLevel> inpiace 
define PSIM.d variable decimal <durLevel, timeLevel, dowLevel> inpiace 
define PI SIM. o variable decimal <durLevel, timeLevel, dowLevel> inpiace 
define PI SIM. d variable decimal <durLevel, timeLevel, dowLevel> inpiace 

Cubes "PSIM.o" and "PI SIM. o" are related to outgoing calls, while 
cubes "PSIM.d" and "PI SIM. d" are related to incoming calls. Cube "PSIM.o" is 
used to measure the similarity of calling patterns like cube "P_CB.o" defined 
before, that take into account the dimensioned probability distribution of a 
customer's calls over all the calls the customer made. However, cube 
"PI SIM. o" is used to measure the similarity of calling patterns like cube 
"P1„CB.o", that take into account the dimensioned probability distribution of a 
customer's calls dimensioned by callees. The analogous difference exists 
between cubes "PSIM.d" and "PI SIM, d". 

Like cubes "VSIM.o", "PSIM.o" and "PI SIM. o" are also dimensioned 
by KdurLevel, timeLevel, dowLevel> . Each of these similarity cubes is 
calculated by comparing two probability-based calling pattern cubes based on 
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vector-distance using type B cell-to-subcube mapping. The following algorithm 
is used to generate cube "PI SIM. o". It takes two calling pattern cubes, 
"P1_CB,o" and 'T1„CB2.o" (defined in the same way as cube "Pl.CB.o") as 
input {since cubes "Pl.CB.o" and "P1_CB2.o" are "views" of cubes "CB.o" and 
''CB2.0", the latter can also be considered input cubes). 

Cube "PISIM.o" Generation Algorithm 

algorithm genPISIM 

{ 

- declare int variables a, b, d 

- limit duration, time, dow to all their values 



- reduce the size of dimensionlD by limiting its values to those that make 
either CB.o > O or CB2.o > O (P1_CB.o and P1_CB2.o are derived from 

15 CB.o and CB2,o) 

- compare pair-wisely (in P1_CB.o and P1_CB2.o) to each callee the calling 
probability over the calls to that callee, then generate a vector-based 
similarity measure corresponding to each level of duration, time and dow 

20 dimensions. 

for each dowLevel 

{ 

limit dow to values at the current dowLevel 
25 for each timeLevel 

{ 

limit time to values at the current timeLevel for each durLevel 

{ 

limit duration to values at the current durLevel) 
30 d = sqrt (total {(Pl.CB.o - P1_CB2.o) * (Pl.CB.o - P1_CB2.o))) 

a = sqrt (total (P1_CB.o * P1_CB.o)) 
b - sqrt (total (P1_CB2.o * P1_CB2.o)) 
conver vector distance to similarity and 
take some kind of average measure, e,g,, 
35 P1S1M.O = 1 - d/(a + b) 

} 

} 



} 



An instance of 'PI SIM. o' is illustrated below. Let us look at the 
following two cells that are based on the same dimension values as the cells 
shown in the above cube "VSIM.o'' examples. 
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1 . Cell PI SIM. o (durLevel 'dur_bin', timeLevel 'time.bin', dowLevel 
'dd') represents the similarity of a corresponding pair of probability- 
based sub-cubes of "PI^CB.o" and 'T1„CB2,o". These sub-cubes are 
based on low-level values of dimension duration, time, dow and all 
values of dimension callee. The value of the above cell is the vector- 
based summarization of cell-wise comparisons of the above pair of 
sub-cubes. 

2. Cell PI SIM. o (durLevel 'dur.all', timeLevel 'time.all', dowLevel 
'week') represents the similarity of such a pair of sub-cubes of 
"Pl.CB.o'' and "P1_CB2.o" that are based on high-level values of 
dimension duration, time and dow, and all values of dimension callee, 

PISIM.o 
DOWLEVEL: week 



TIMELEVEL 



P1SIM.O 

DURLEVEL 

dur all dur_bin 



time_all 
time bin 



DOWLEVEL: WW 



1.00 

0.94 



0.71 
0.70 



TIMELEVEL 



PI SIM. 0 

DURLEVEL 

dur all dur_bin 



time_all 
time bin 



0.95 
0.92 



0.72 
0.71 



30 



35 



40 



DOWLEVEL: dd 



TIMELEVEL 



time_all 
time bin 



PI SIM. 0 

DURLEVEL 

dur ail dur_bin 



0.77 
0.73 



0.63 
0.61 
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The size of each profile cube may be controlled by partitioning the 
customers represented in a profile cube by area; and by limiting the profiling 
period. 

Multiple calling pattern cubes may be generated to represent a 
customer's calling behavior from different aspects. According to one design, 
several calling pattern cubes representing ''probability-based information" are 
actually derived from intermediate calling pattern cubes representing "volume- 
based information". 

Let us consider a volume-based cube V for a single customer derived 
from the above profile cube by totaling outgoing calls over days of week. V 
holds the counts of calls during the profiling period dimensioned by <time, 
duration^ callee> , where dimension time has values 'morning', 'evening', etc; 
duration has values 'short', 'long', etc; and dimension callee contains the called 
phone numbers. A cell in the cube is identified by one value from each of these 
dimensions. The value in a ceil identified by duration = 'short', time = 
'morning', cai/ee = '4089194444', is the number of calls made from this 
customer to '4089194444' in the mornings {say, 8-12 a.m.) that are 'short' 
(say, less than 20 min), during the profiling period. 

From cube V the following different probability cubes (and others) may 
be generated: 

1 . Cp„for the prior probability of time-bin of calls with 
respect to each callee, that is dimensioned by <time, callee>, and 
indicates the percentage of calls made in 'morning', 'afternoon', 
'evening' and 'night', respectively. 

2. Cpfor the conditional probability of call duration-bin given 
time-bin of calls with respect to each callee, that is dimensioned by 
<time, duration, callee>, indicates the percentage of calls that are 
'long', 'medium' and 'short', respectively, given the time-bin. 

3. C^o^for the probabilistic consequence of the above; i.e., 
the probability of calls in every cell crossing dimensioned by <time, 
duration, callee> over the total calls. 
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In fact, all the above probability cubes, C^^;, Cp, and C^^^, can be 
derived from cube V using OLAP operations. In the Oracle Express OLAP 
language, these are expressed as 

Cp^. = totaKV, time, callee) / total(V, callee) 
5 cp^ (V/ CpJ / total fV, callee) 

C^^^ = V/totaKV, callee) 

In the above expressions, "totar is a typical OLAP operation on 
cubes with numerical cell values. While ""total(V)"' returns the total of the cell 
values of V, "totaliV, callee) returns such a total dimensioned by ''callee, 

10 totaKV, time, callee)"' returns such a total dimensioned by "time'' and "callee". 
In fact, a dimensioned "total" represents a cube. The arithmetic operations on 
cubes, such as 7' used above, are computed cell-wise. 

One example illustrating the derivation of probability cubes from 
volume cubes is shown in Figure 2. According to Figure 2, the derivation of 

15 probability cubes 60, 62 and 64 is illustrated from a volume cube 66. More 
particularly, volume cube 66 represents a volume-based cube, V, from which 
cubes 60, 62 and 64 are derived. Cube 60 represents prior probability cube Cp,,. 
Cube 62 represents conditional probability cube Cp. Cube 64 represents 
probabilistic consequence cube Ccon- 

20 With the above mechanism it is only necessary to make volume 

cubes persistent data-warehouse objects. In the other worlds, only the volume 
based Information needs to be profiled, and calling patterns, either based on 
volume or probability, can be derived. This shows the simplicity, and yet 
powerfulness, of using OLAP to handle probability based profiling. 

25 

III. Architecture of the Profiling Engine 

The profiling engine 16 (of Fig. 1) provides the following major 

functions: 

1 . Building and incrementally updating customer calling 
30 behavior profiles by mining call records flowing into the data-warehouse 

daily, using an OLAP server. 
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2. Maintaining profiles by staging data between the data- 
warehouse and the OLAP multidimensional database. 

3. Deriving multilevel and multidimensional customer calling 
patterns from profiles for analysis. 

4. Comparing the similarity of customer calling patterns from 
volume and probability distribution points of view, and generating 
multilevel and multidimensional similarity measures, to be used in such 
applications as fraud detection. 

The profiling engine is built on top of an Oracle-8 based data- 
warehouse and Oracle Express, an OLAP server. Call data records, customer 
behavior profiles and other reference data are stored in warehouse 46 (of 
Fig. 1). Call data records are fed in daily and dumped to archive after use. 
OLAP server 14 (of Fig. 1) is used as a computation engine for creating and 
updating profiles, deriving calling patterns from profiles, as well as analyzing and 
comparing calling patterns. 

Figure 3 illustrates one exemplary data-warehoused OLAP based 
profiling engine architecture. According to the implementation depicted in 
Figure 3, customer profiling is repeated periodically. For example, customer 
profiling is implemented on a daily basis. More particularly, call records 67, in 
the form of call data 68, are loaded into call data tables 72 that are provided 
within data-warehouse 46. Call data tables 72 are then loaded into OLAP server 
14 in order to generate a "profile-snapshot" cube 74 that is multi-customer 
oriented. As shown in Figure 3, data-warehouse 46 also includes a profile table 
70. Additionally, OLAP server 14 is provided within profiling engine 16, which 
further includes OLAP engine 66 (which is formed by part of profiling engine 
16). 

Parallel with the above-described previous step, a profile cube 76 
is retrieved from data-warehouse 46 via OLAP server 14, and based upon 
information provided within data-warehouse 46. Profile cube 76 covers the 
same set of customers as were used to derive profile-snapshot cube 74. 
Subsequently, profile cube 76 is updated by merging profile cube 76 with 
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profile-snapshot cube 74 so as to form an updated profile cube 78. Updated 
profile cube 78 is then stored back in profile table 70 provided within data- 
warehouse 46 by way of feedback loop 88. The frequency of data exchange 
occurring between data-warehouse 46 and OLAP server 14 is controlled by 
specific data staging policies that are implemented therebetween. Updated 
profile cube 78 is further parsed into calling pattern cubes 80, 82, 84 and 86 of 
individual customers according to previously described techniques. Additionally, 
OLAP tools 52 covering reporting, analysis, and visualization techniques are 
provided within client 24. 

In order to reduce data redundancy and query cost, one goal is to 
maintain minimal data in the profile tables in the data-warehouse. Multiple 
customers calling information can be included in a single profile table or profile 
cube, without separating information on outgoing calls and incoming calls. The 
relational schema of the profile table can be made to directly correspond to the 
base level of the profile cube. Derivable values at higher levels are not 
maintained in data-warehouse 46 (of Fig. 3). 

Below is the Oracle-8 schema of profile table 70 (of Fig. 3), herein 
identified as ''Profile", where "pc" is the number of calls dimensioned by other 
attributes: 

Table 4 - Oracle-8 Table Schema 
// OracleS table schema 



CREATE TABLE Profile 
( 

caller 
callee 
duration 
time 
dow 
pc 



VARCHAR2(10) 
VARCHAR2(10) 
CHAR(I) 
CHAR(I) 
chard) 
INTEGER 



NOT NULL, 
NOT NULL, 
NOT NULL, 
NOT NULL, 
NOT NULL, 



) 



STORAGE 



The corresponding profile cube, PC, is defined in Oracle Express as 
shown below, where dow stands for day_of_week {e.g.r Monday,...., Sunday), 
as mentioned before: 



30 

// Oracle Express cube definition 

Define PC variable int < sparse < duration time dow callee cailer> > inplace 

It can be seen that the attributes of table "Profile" has one-to-one 
correspondence to the dimensions, including the output dinnension, of cube PC. 
This arrangement makes data staging straightforward. 

OLAP is used to handle the following tasks: 

1 . Retrieve profile tables to generate "profile" cubes. 

2. Retrieve call data tables to create "profile-snapshot" cubes. 
Profile-snapshot cubes and profile cubes have the same dimensions, with 
cell values representing number of calls. As a result, a profile-snapshot 
cube and a profile cube covering the same group of callers can be easily 
merged by addition. 

3. Derive individual customer based "calling pattern" cubes 
from profile cubes. Various kinds of pattern cubes are defined. By rolling 
up these cubes, multilevel calling patterns can be supported. 

4. Analyze individual calling patterns in multiple dimensions 
and at multiple levels. 

5. Compute the similarity of calling patterns that belong to 
different customers or belong to the same customer but cover different 
profiling periods. Using OLAP operations pattern similarity may be 
compared from different aspects {e.g,^ volume or probability distribution), 
in different dimensions {e,g,^ weekday behavior, or evening behavior 
regardless of which day of the week), and at different levels (e.g,^ daily 
or weekly). 

OLAP engine 66 (of Fig. 3) actually serves as a scalable 
computation engine. From a performance point of view, engine 66 supports 
indexed caching, reduces database access dramatically and extends main 
memory based reasoning. From a functionality point of view, engine 66 allows 
one to deliver powerful solutions for profiling, pattern generation, analysis and 
comparison, in a relatively simple and flexible manner. 
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IV. Profile Cubes and Calling Pattern Cubes 

As described above, two general kinds of cubes are introduced. 
First, multi-customer based profile cubes and, secondly, single customer based 
calling pattern cubes. 

5 

A. Profile Cubes 

A profile cube (PC) 76 (of Fig. 3) and a profile-snapshot cube 
(PCS) 74 (of Fig. 3) have the same underlying dimensions, and contain profiling 
information of multiple customers in direct correspondence with the relational 
10 tables in data-warehouse 46 (of Fig. 3). In the Oracle Express language they are 
defined as 

define PC variable int < sparse < duration time dow callee caller > > inplace 
define PCS variable int < sparse < duration time dow callee caller > > 
inplace 

15 

where callee and caller are called and calling numbers; dimension time has 
values 'morning', 'evening', etc, for time-bins; dimension duration has values 
representing duration-bins {e.g., 'short'); and dimension dow has values 
representing days of week (e.g., 'MON'). 

20 Note that the use of keyword "sparse" in the above definitions 

instructs Oracle Express to create a composite dimension <duration time dow 
callee caller> , in order to handle sparseness, particularly between calling and 
called numbers, in an efficient way. A composite dimension is a list of 
dimension-value combinations. A combination is an index into one or more 

25 sparse data cubes. The use of a composite dimension allows storing sparse data 
in a compact form similar to relation tuples. 

Figure 4 illustrates the population of profile-snapshot cube (PCS) 
74 by use of "binning". A call data record 50 contains fields 90, 92 and 94, 96 
with values mapping to each dimension of the PCS cube 74. Such mapping is 

30 referred to as "binning". For example, '9am' is mapped to time-bin 'morning', 5 
minutes is mapped to duration-bin 'short'. A call made at 9am and lasting 5 
minutes falls into the cell corresponding to time = 'morning' and duration = 
'short'. 
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As shown in Figure 3, profile cube (PC) 76 is retrieved from a 
database 89 in data-warehouse 46 and updated by merging with PCS 74, and 
then stored back into database 89. In Oracle Express, the merge of PC 76 and 
PCS 74 is simply expressed as: 



5 PC = PC -f PCS 

In this way customer profiles are updated incrementally as each 
new batch of call data records 67 flows into data-warehouse 46 (of Fig. 3). 

B. Hierarchical Dimensions for Multilevel Pattern Representation 

10 Calling pattern cubes are derived from profile cubes and used to 



represent the calling behavior of individual customers. In order to represent 
such calling behavior at multiple levels, dimensions dow, time and duration are 
defined as hierarchical dimensions, along which the calling pattern cubes can be 
rolled up. 

15 A hierarchical dimension D contains values at different levels of 

abstraction. Associated with D there are a dimension DL describing the levels of 
D, a relation DL D mapping each value of D to the appropriate level, and a 
relation D_D mapping each value of D to its parent value (the value at the 
immediate upper level). Let D be an underlying dimension of a numerical cube C 
20 such as a volume-based calling pattern cube. D, together with DZ., DL_D and 

D_D, fully specify a dimension hierarchy. They provide sufficient information to 
rollup cube C along dimension D; that Is, to calculate the total of cube data at 
the upper levels using the corresponding lower-level data. A cube may be rolled 
up along multiple underlying dimensions. 
25 1 . dow hierarchy 

The dow hierarchy is made of the following objects: 

a. dow(6ay of week): dimension with values 'MON', ... 
'SUN' at the lowest level (dd level), 'wkday', 'wkend' at a higher 
level (ww level), and 'week' at the top level ('week' level). 
30 b. dowLevel; dimension with values 'dd', 'ww', 'week' 



Case 10991149-1 



33 

c. dow_dow: relation (dow, dow) for mapping each 
value to its parent value; e.g., 

dow_dow(dow 'MON') == 'wkday' 

dow_dow{dow 'SAT') = 'wkend' 
dow_dow(dow 'wkday') = 'week' 
dow_dow(dow 'wkend') = 'week' 
dow_dow(dow 'week') = NA 

d. dowLevel_dow: relation (dow, dowLevel) for mapping 
each value to its level; e.g., 

dowLevel dowldow 'MON') = 'dd' 

dowLevel_dow{dow 'wkday') = 'ww' 
dowLevel_dow(dow 'wkend') = 'ww' 
dowLevel_dow(dow 'week') = 'week' 
2. time hierarchy 

The time hierarchy is made of the following objects. 

a. time: dimension with values 'night', 'morning', 
'afternoon' and 'evening' at 'time_bin' level (bottom-level), and 
'allday' at the 'time air level (top-level). 

b. timeLevel: dimension with values 'time_bin' and 
'time_^air. 

c. time_time: relation (time, time) for mapping each 
value to its parent value; e.g., 

time__time(time 'morning') = 'allday' 

time_time(time 'allday') = NA 

d. timeLevel_time: relation (time, timeLevel) for mapping 
each value to its level; e.g., 

timeLevel_time(time 'morning') = "time^bin' 
timeLevel_time(time 'aliday') = 'time^ali' 
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3. duration hierarchy 

The duration hierarchy is made of the following objects: 

a. duration-, dimension with values 'short', 'medium', 
'long' at 'dur_bin' level (bottom-level, and 'air and 'dur_air level 
(top-level) 

b. durLeveh dimension with values 'dur_bin' and 
'dur_all' 

c. dur_dur: relation (duration, duration) for mapping 
each value to its parent value; e.g., 

dur_dur(duration 'short') = 'air 

dur_dur(duration 'air) = NA 

d. durLevel dur: relation (duration, durLevel) for 
mapping each value to its level; e.g., 

durLevel_dur(duration 'short') = •dur_bin' 

durLevel_dur(duration 'air) = 'dur_air 
For profile storage, combination and updating, only the bottom 
levels are involved, therefore rolling up profile cubes such as PC is unnecessary. 
Rolling up is only applicable to calling pattern cubes for analysis purposes, 

C. Cubes Representing Volume-based Calling Patterns 

A calling pattern cube is associated with a single customer for 
representing the individual calling behavior of that customer. As the calling 
behavior of a customer may be viewed from different aspects, different kinds of 
calling pattern cubes may be defined. These cubes are commonly dimensioned 
by time, duration and dow (day of week), and in addition, for those related to 
outgoing calls, dimensioned by callee, and for those related to incoming calls, 
dimensioned by caller. Their cell values represent the number of calls, the 
probability distributions, etc. Calling pattern cubes are derived from profile 
cubes, say, PC, and then rolled up. Below are some examples. 
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1 . Volume based calling patterns 

Cube "CB.o" represents the outgoing calling behavior of a 
customer. In Oracle Express that representation is defined by: 

define CB.o variable int < sparse < duration time dow callee> > inplace 
Similarly, cube 'XB.d" representing incoming calling 

behavior is defined by: 

define CB.d variable int < sparse < duration time dow caller > > inplace 

The cell values of these cubes are the number of calls falling 
into the given 'slot' of time, duration, day of week, etc. When generated, 
cubes "CB.o" and "CB.d" are rolled up along dimensions duration, time 
and dow. Therefore, 

CB. ©(duration 'short', time 'morning', dow 'MON') 
measures the number of short-duration calls this customer made to each 
callee (dimensioned by callee) on Monday mornings during the profiling 

interval. Similarly, 

CB.o(duration 'air, time 'allday', dow 'week') 
measures the number of calls this customer made to each callee (total 
calls dimensioned by callee) during the profiling interval. 

2. An Algorithm for Deriving Calling Pattern Cube 

Below an algorithm is shown for generating the calling 
pattern cube ''CB.o" from profile cube PC defined above. Other calling 
pattern cubes can be generated in the similar way. 

Cube 'TC" is pre-populated using the data retrieved from 
database table "Profile" and possibly combined with cube PCS that is 
generated from loading call data. With the following algorithm, a calling 
pattern cube, "CB.o", is populated for a given customer as specified by 
parameter customer_calllD. 

Derive Calling Pattern Cubes Algorithm 

define genCB (customer_calllD text) 

{ 
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■■ if customer_calllD is not a value of caller then return 

- remove old cells of CB.o by 

limit dimensions duration, time, dow and callee to all their 

values 

CB.o = NA 

- limit dimensions duration, time, dow to their bottom level 
values 

- limit dimension caller to the given customer by 
limit caller to customer_calllD 

- limit dimension callee to those being called by the given 
customer, 

as limit callee to any(PC > 0, callee) 

- form a subcube of PC by selecting only the data related to 
the given customer (the current value of caller dimension), 
then transfer (unravel) its cell values to cube CB.o, as 
CB.o = unraveKtotaKPC, duration time dow callee) 

- rollup CB.o by 

limit duration, time, dow to ail their values 

rollup CB.o over duration using dur_dur 
rollup CB.o over time using time_time 
rollup CB.o over dow using dow_dow 

} 

D. Cubes Representing Probability Distribution based Calling Patterns 

Cubes representing probability distribution based calling patterns 
are derived from volume-based pattern cubes. Depending on application 
requirement various cubes may be derived. Two kinds of such cubes are listed 
below. 

1 . Probability Distribution on All Calls 

Cube "P_CB.o" for a customer represents the dimensioned 
probability distribution of outgoing calls over a// the outgoing calls made 
by this customer, and is derived from cube "CB.o" in the following way: 

define P_CB.o formula decimal <duration time dow callee> 
EQ (CB.o/total(CB.o(duration 'all', 'allday', dow 'week'))) 

where total(CB.o{duration 'all', 'allday', dow 'week')) is the total calls this 
customer made to all callees. Remember that cube "CB.o" has already 
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been rolled up, hence we can use its top-level value. The value of a cell is 
the above probability which corresponds to the underlying dimension 
values. 

2. Probability Distribution on Calls to Each Callee 

Cube 'T1_CB.o" is dimensioned by duration, ... and callee, 
and represents the probability distribution of a customer's outgoing calls 
over the customer's total calls to the corresponding callee, and is also 
derived from cube "CB.o" as specified in the following, 

define F^1_CB.o formula decimal <duration time dow callee> 
EQ {CB.o/total(CB.o{duration 'air, 'allday', dow 'week'), callee)) 

where "total{CB.o(duration 'all', 'allday', dow 'week'), callee)" is the total 
number of calls this customer made to each callee (dimensioned by 
callee). The value of a cell is the above probability corresponding to the 
underlying dimension values. 

Calling pattern cubes for incoming calls can be defined in a 
similar manner. 

V. Conclusions 

The problem of customer behavior profiling occurs in many 
applications such as telecommunications and electronic commerce. Pursuant to 
Applicant's techniques, a data-warehouse and OLAP based framework for 
customer behavior profiling has been described particularly with reference to its 
use in one specific telecommunication application. 

A prototype has been implemented on top of an Oracle-8 based 
data-warehouse with an Oracle Express OLAP server. This prototype has 
demonstrated the practical value of using OLAP server as a scalable 
computation engine for creating and updating profiles, deriving calling patterns 
from profiles, as well as analyzing and comparing calling patterns. 
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Additionally, tiie customer behavior profiling described above can 
be further scaled by developing a multilevel, distributed profiling framework 
incorporated with a dynamic agent infrastructure. 

Figure 5 discloses a logic flow diagram for the programming of 
general circuitry, or processing circuitry, present within profiling apparatus 10 of 
data mining system 18 (of Fig. 1). The following steps list a method for 
comparing customer behavior patterns. 

In Step "SI call data is provided in the form of call data records 
to a data-warehouse. After performing Step "SI", the process proceeds to 
Step "S2". 

In Step "S2'\ the process loads the call data records into an OLAP 
server. After performing Step "S2", the process proceeds to Step "S3". 

In Step "S3", the process generates a profile-snapshot cube 
accommodating multiple customers. After performing Step "S3", the process 
proceeds to Step "S4". 

In Step "S4", in combination with generating the profile-snapshot 
cube, the process generates a profile cube for the same set of customers from 
the data warehouse. After performing Step "S4", the process proceeds to 
Step "S5". 

In Step "S5", the process updates the profile cube by merging the 
profile cube with the profile-snapshot cube. After performing Step "S5", the 
process proceeds to Step "S6". 

In Step "S6", the process stores the updated profile cube in the 
data-warehouse. After performing Step "S6", the process is terminated. 

In compliance with the statute, the invention has been described in 
language more or less specific as to structural and methodical features. It is to 
be understood, however, that the invention is not limited to the specific features 
shown and described, since the means herein disclosed comprise preferred 
forms of putting the invention into effect. The invention is, therefore, claimed in 
any of its forms or modifications within the proper scope of the appended claims 
appropriately interpreted in accordance with the doctrine of equivalents. 
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CLAIMS 

What is claimed is: 

1 1 . A customer profiling apparatus for coriducting customer 

2 behavior pattern analysis, comprising: 

3 processing circuitry operative to process customer records; 

4 a data warehouse coupled with the processing circuitry and 

5 configured to store the processed customer records; 

6 a profiling engine communicating with the data warehouse and 

7 operative to build and update customer behavior profiles by mining the customer 

8 records that flow into the data warehouse; and 

9 at least one computer program, performed by the profiling engine, 

10 and operative to define behavior profiles as data cubes and derive similarity 

1 1 measures on patterns extracted from the behavior profiles. 

1 2. The apparatus of claim 1 wherein the profiling engine is an 

2 OnLine Analytical Processing (OLAP) based scalable and flexible profiling engine. 

1 3. The apparatus of claim 1 wherein the profiling engine 

2 comprises a commercial data warehouse server and a multi-dimensional OLAP 

3 server. 

1 4. The apparatus of claim 1 wherein the profiling engine 

2 implements multi-level, multi-dimensional pattern analysis and comparison. 

1 5. The apparatus of claim 1 wherein the behavior profiles are 

2 defined at least in part by probability distributions. 

1 6. The apparatus of claim 1 wherein similarity measures are 

2 defined and computed on the patterns extracted from the behavior profiles. 
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1 7. The apparatus of claim 1 wherein the computer program is 

2 further operative to compare the data cubes with similarity measures identifying 

3 fraud so as to extract fraud detection from the behavior profiles, 

1 8. The apparatus of claim 1 wherein the customer records 

2 comprise customer communication call records, and the behavior profiles are 

3 derived from telephone call data. 

1 9, The apparatus of claim 8 wherein the behavior profiles are 

2 analyzed to detect caller fraud. 

1 10. The apparatus of claim 1 wherein the customer records 

2 comprise customer call records, the profiling engine builds and updates 

3 customer calling behavior profiles by mining the customer call records, and the 

4 computer program derives similarity measures on patterns extracted from the 

5 call behavior profiles. 

1 1 1 . A profiling apparatus, comprising: 

2 a data warehouse for storing customer records; 

3 a profiling engine communicating with the data warehouse and 

4 operative to generate customer behavior profiles from the customer records 

5 within the data warehouse; and 

6 a computer application program implemented on the profiling 

7 engine and operative to represent behavior profiles as patterns and derive 

8 similarity measures of the patterns usable to profile customer behavior. 

1 12. The apparatus of claim 1 1 wherein the profiling engine is 

2 configured to define customer behavior profiles using probability distributions, 

3 and to compute the customer behavior profiles using OLAP operations on multi- 

4 dimensional and multi-level data cubes. 
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1 13, The apparatus of claim 12 wherein one multi-level data cube 

2 comprises a profile cube. 

1 14. The apparatus of claim 13 wherein another multi-level data 

2 cube comprises a profile-snapshot cube. 

1 15. The apparatus of claim 14 wherein yet another data cube 

2 comprises an updated profile cube formed by merging together the profile cube 

3 and the profile-snapshot cube. 

1 16. The apparatus of claim 15 wherein the updated profile cube 

2 is stored within a profile table of the data warehouse such that subsequent 

3 customer profiling utilizes customer records from the data warehouse comprising 

4 the updated profile cube. 

1 17. A method for comparing customer behavior patterns, 

2 comprising: 

3 providing call data in the form of call data records to a data 

4 warehouse; 

5 loading the call data records into an OLAP server; 

6 generating a profile-snapshot cube accommodating multiple 

7 customers; 

8 in combination with generating the profile-snapshot cube, 

9 generating a profile cube for the same set of customers from the data 

10 warehouse; 

1 1 updating the profile cube by merging the profile cube with the 

12 profile-snapshot cube; and 

13 storing the updated profile cube in the data warehouse. 

1 18. The method of claim 17 wherein the data warehouse 

2 comprises profile tables configured to store the profile cube. 
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^ The method of claim 17 wherein the updated profile cube is 

2 subdivided into a plurality of individual calling pattern cubes, each representative 

3 of individual customers. 

1 20. The method of claim 1 9 further comprising the step of 

2 performing at least one of reporting, analyzing, and visualizing of one of the 

3 calling pattern cubes for an individual customer. 
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ABSTRACT 

A customer profiling apparatus is provided for conducting customer 
behavior pattern analysis. The apparatus includes processing circuitry, a data 
warehouse, a profiling engine, and at least one computer program. The 
5 processing circuitry is operative to process customer records. The data 

warehouse is coupled with the processing circuitry and is configured to store the 
processed customer records. The profiling engine communicates with the data 
warehouse and is operative to build and update customer behavior profiles by 
mining the customer records that flow into the data warehouse. The at least 
10 one computer program is performed by the profiling engine, and is operative to 
define behavior profiles as data cubes and derive similarity measures on patterns 
extracted from the behavior profiles. A method is also provided. 
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